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Abstract. In a recent paper [3], Duarte and Jutten investigated the Blind Source 
Separation (BSS) problem, for the nonlinear mixing model that they introduced in that 
paper. They proposed to solve this problem by using information-theoretic tools, more 
precisely by minimizing the mutual information (MI) of the outputs of the separating 
structure. When applying the MI approach to BSS problems, one usually determines the 
analytical expressions of the derivatives of the MI with respect to the parameters of the 
considered separating model. In the literature, these calculations were mainly reported 
for linear mixtures up to now. They are more complex for nonlinear mixtures, due to 
dependencies between the considered quantities. Moreover, the notations commonly 
employed by the BSS community in such calculations may become misleading when using 
them for nonlinear mixtures, due to the above-mentioned dependencies. We claim that 
the calculations reported in |3j contain an error, because they did not take into account 
all these dependencies. In this document, we therefore explain this phenomenon, by 
showing the effect of indirect dependencies on the application of the MI approach to the 
mixing and separating models considered in [3J. We thus introduce a corrected expression 
of the gradient of the considered BSS criterion based on MI. This correct gradient may 
then e.g. be used to optimize the adaptive coefficients of the considered separating system 
by means of the well-known gradient descent algorithm. As explained hereafter, this 
investigation has some similarities with an analysis that we previously reported in another 
arXiv document [3]. However, these two investigations concern different problems, not 
only in terms of the considered type of mixture and separating structure, but also of the 
mathematical tools used to develop BSS methods for these configurations (information 
theory vs maximum likelihood approach). 
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1 Data model 



Blind source separation (BSS) consists in restoring a vector s(t) of N unknown source 
signals from a vector x(t) of P observed signals (most often with P = N), where x(t) is 
derived from s(t) through an unknown mixing function g, i.e. 

x(t)=g(s(t)). (1) 

Recently, Duarte and Jutten investigated a specific version of this problem [3], which 
involves P = 2 observed signals x\(t) and x 2 (t), which are derived from N = 2 source 
signals s\(t) and S2(t), through the nonlinear function defined as 

xi(t) = Sl (t) + a 12 (s 2 (t)) k (2) 
x 2 {t) = s 2 (t) +a 2 i(s 1 {t))k (3) 

This data model is derived from the Nikolsky-Eisenman empirical model for 
potentiometric-based ion concentration sensors [I]. As in [1], we omit the time index 
t in signal notations hereafter, for readability. The mixing model (|2~|)-(|3"]) may then also be 
expressed in compact form as 

x = g(s). (4) 

In this equation, s = [si,s 2 ] T and x = [xi,x 2 ] T , where T stands for transpose, and the 
nonlinear mixing function g has two components g± and g 2 , with X{ = g%(s), Vi E {1)2}. 
These components gi are respectively defined by ([2j) and Q. Eq. (j3J) focuses on the signals 
(i.e. sources and observations). It hides the fact that the observations also depend on the 
parameters of the mixing model, i.e. on a\ 2 and 021 in the model considered here. This 
additional dependency can be made explicit, by rewriting (j3|) as 

x = g(s,a 12 ,a 2 i). (5) 



2 Previously reported results for mutual information mini- 
mization 

2.1 Overview and issue of previous method 

As suggested above, the BSS problem associated with the mixing model ([I])-© consists 
in retrieving a sequence of unknown source vectors s from the corresponding sequence of 
measured observation vectors x and from the mixing parameters a\ 2 and 021, which are 
also initially unknown. These mixing parameters should therefore be estimated before 
proceeding to the source restoration step. Creating an overall BSS method thus consists 
in defining two items, i.e. i) a "separating structure", which performs the inversion of 
the mixing equations ([2])-([3]) for known mixing parameter values, and ii) a procedure for 
estimating these mixing parameters. 

The separating structure used in [3J was derived by Duarte and Jutten from the struc- 
ture for linear-quadratic mixtures proposed by Hosseini and Deville in [5 J , [6\ , pQ , [2] . The 
structure in [4j belongs to the general class of structures proposed by Deville and Hosseini 
in [2] for the ATM class of mixing models, which includes the specific model ([2])- ([3]). 

As for the estimation of the mixing parameters, Duarte and Jutten developed a pro- 
cedure based on information-theoretic tools, more precisely on the minimization of the 
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mutual information (MI) of the outputs of the separating structure. However, we here 
claim that this procedure contains an error, which is due to a difficulty encountered with 
nonlinear mixing models in general, for different classes of BSS methods. This difficulty 
is somewhat similar to the one that we highlighted in another arXiv document [3]: un- 
like the method considered hereafter, the BSS approach described in j3] is not based on 
information theoretic tools, but on the maximum likelihood framework. Moreover, it con- 
cerns a different class of nonlinear mixtures. However, similar quantities appear in the 
calculations performed for both methods^, and they deserve special care in both of them. 

The current document therefore aims at explaining and correcting the error which was 
made in [2J. We thus show how the BSS method of [3J should be modified so as to actually 
achieve mutual information minimization. Before focusing on the issue faced in [3], we 
now summarize the features of that approach which are of importance hereafter. 

2.2 Description of previous method 

The considered separating structure has internal adaptive coefficients w\2 and W2i- For 
each time t, this structure determines and output vector y = [yi,y2] T from its current 
internal coefficients and from the current observation vector x. To this end, it iteratively 
updates its output according to 

yi(n + l) = x\ - w 12 (y2(n)) k (6) 
y 2 (n + l) = x 2 - W2i{y\{n))k . (7) 

The convergence of this recurrence therefore corresponds to a state such that 

yi = xi- w X2 y\ (8) 
i 

y 2 = x 2 - w 2 iyx ■ ( 9 ) 

For a given time t, we denote as Y\ and Y2 the random variables respectively associ- 
ated with the output signal samples y\ and 7/2 obtained after the above recurrence has 
converged. We also define the corresponding output random vector as Y = [Yi, Y^] 7 '. 

The optimum values of W12 and W21 are defined as those which minimize the mutual 
information of Y\ and Y 2 , which is denoted I(Y). Equivalently, they are those which 
minimize a quantity C(Y). This quantity is equal to I(Y), up to an additive term which 
only depends on the observations and which therefore does not depend on w\2 and u>2i- 
That quantity reads 

C(Y) = &H(Yj\-E{hi\J h \} (10) 

where H(Yi) is the differential entropy of Yi while E{.} stands for expectation and Jh is 
the JacobiarH of the separating function h = g~ x achieved by the considered separating 

x The quantities to be respectively considered in these two methods depend on different signals (source 
signals vs outputs of separating system) and functions (mixing function vs separating function) . However, 
these signals and functions yield similar phenomena concerning the topic addressed in this document. 

2 For the sake of readability, we use the same notation, i.e. A, for (i) the sample value of this Jacobian 
associated to sample values y\ and 1/2 (see e.g. (HI}) and (ii) the random variable defined by this quantity 
when considered as a function of the random variables Y\ and Y2 (see e.g. HH). To know whether we are 
considering the sample value of Jh or the associated random variable in an equation, one just has to check 
whether that equation involves the sample values yi and j/2 or the associated random variables Yi and I2: 
see e.g. (THJ) and (TT2)) . 
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structure, i.e. J/j is the determinant of the Jacobian matrix of h. For the function h 
considered in this investigation, the authors show that 



J h = I=TTT' ( u ) 

1 - w-nw^iVx 2/2 1 

To determine the values of W12 and W21 which minimize C(Y), the authors then consider 
the gradient of C(Y) with respect to the vector composed of w\2 and 1021- Each component 
of this gradient is equal to the derivative of C(Y) with respect to one of the parameters 
Wkt- In [3], the authors denoted this gradient by using the notation most often employed 
in the BSS community (see e.g. [7]), i-e. each of its components reads ®q^^ ■ We keep this 
notation in this section, in order to clearly refer to the equations available in [3], but in 
Section [3] we will show that it may be misleading and we will therefore introduce another 
notation. So, in [3], it was showed that these derivatives read 

^J± EmY ^ } )_ E{ ±^ } (12) 

dw ke dw ke J J h dw k i 

where 

Mu) = - d -^^ ViG {1,2} (13) 
au 

are the score functions of the output signals, denoting fy i (.) the probability density func- 
tions of these signals. 

The last stage of this investigation consists in deriving the expressions of all the terms of 
the right-hand side of (]12p . In Equation (26) of [3], an explicit expression is provided and 
it is stated that it is equal to (the vector form of) the term E{-^- Jj-jk. } which appears in 
(fl2l) . We claim that this is not true, because the expression whose expectation is provided 
in the right-hand side of Equation (26) of [4J is only one of the terms which compose 
the complete expression to be then used in (|12p as the term misleadingly denoted -j^ J^ £ 



in (I12D . In the following section of the current document, we clarify this point and we 

1 dJ h 

Jh 9w ki 



determine the complete expression of the term denoted -j- ~ Jh in (|12p . We also comment 



about the other terms of (I12p . 

3 New results for mutual information minimization: cor- 
rected expression of gradient 

When determining the values of w\2 and 1021 which minimize C(Y), that function C(Y) 
is considered for the fixed set of observed vectors. The only independent variable in this 
approach is the set of parameters to be estimated, i.e. w%2 and 1021- The outputs y\ and 
y2 of the separating system are dependent variables, here linked to the observations and to 
w\2 and W21 by ([H])-©. The overall variations of C{Y) with respect to w\2 and 1^21 result 
from two types of terms contained in the expression of C(Y), i.e. (i) the terms involving 
W12 and 1021 themselves and (ii) the terms involving the output random variables Y\ and 
Y2, which are here considered as functions of W12 and W21 and which may therefore be 
denoted as Y\ (w\2 , W21 ) and I2 (^12 , W21 ) for the sake of clarity. 

This approach should be kept in mind when interpreting all equations in [3], which 
were partly gathered in Section [2] of the current document. Especially, the func- 
tion C(Y) itself, which appears in the left-hand side of (fT0|) . may be denoted as 
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C(w\2, W21, Yl(w_2, IV21), ^2(^12, W21)) for the sake of clarity. In order to determine the 
location of the minimum of this function, one should then consider the total derivatives of 
C(wi2, w^i, Yi(wi2, 5^(iwi2, W21)) with respect to w\2 and 1021- The notations with 
partial derivatives in (I12p may therefore be misleading, as confirmed below. Therefore, 
(|12p should preferably be rewritten aa_^| 

_^_(£ Wd «>)_ B{ i**, (14 , 

dw k e dw ke J Jhdwki 

still with (|13p . The term in (j!4[) then deserves some care because, as shown by (jlip . 
the Jacobian contains the above-defined two types of dependencies with respect to W12 
and W21, i.e. (i) direct dependencies due to the factors in (jlip which explicitly contain W12 
and W21 and (ii) indirect dependencies due to the factors in (jXTJ) which depend on y\ and 
1/2, which themselves depend on W12 and W21 in this approach. We here have to consider 
the total derivative g^- 3 which takes into account both types of dependencies, and which 
therefore reads 

dJh dJ h A dJ h dyi 

+ Z^7^-X— • ( 15 ) 



dw M dw M ^ % dw u ' 
d J 

In this expression, — is the partial derivative of Jh with respect to Wki, calculated by 

ow ke 

considering that the signals y\ and y2 are constant (in addition to the fact that the other 
internal coefficient W{j of the separating system is also constant). This partial derivative 
is the quantity that is taken into account in the right-hand side of (26) of [4j. However, let 
us insist again that this partial derivative is first to be added with the other terms in the 

right-hand side of (|15p . in order to obtain the overall total derivative — — — defined by (I15p . 

dwke 

What should eventually be used in the last term of (fT2"|) or (fT4"|) is this total derivative. 

So, starting from the expression of Jh provided in (fTTj) . one easily derives all its partial 
derivatives involved in (|15p . They read as follows 

dJh = W2iyl ly %~ 1 (16) 

dwio n i _1 fe-ii2 

[1 - wx2W 2 iyx y 2 V 

dJh _ wuyf 1 V2~ 1 / 17 j 



3 Each derivative d ^ y ^ is "total" only with respect to the considered coefficient Wkt (which is one of 
the two coefficients W12 and W21), i.e. it takes into account all variations of C(y) with respect to that 
coefficient Wke while the other coefficient, i.e. wet, is kept constant. For the sake of clarity, we could 

therefore denote that derivative ( I , to show that wet is constant. However, this would decrease 

readability. Therefore, in all this paper we omit the notation (.) w , but it should be kept in mind that each 
considered derivative with respect to Wki is calculated with wtk constant. Then, in this framework, what 
we have to distinguish are: (i) the total derivative due to the variations of Wkt, Y\ and I2 and (ii) the partial 
derivative only due to Wki- We then have to use two different notations for these two types of derivatives, 
such as ^ h and ^ h in (115[) . This type of notations is commonly used in the literature for functions 
which depend (i) on a single independent variable, i.e. time, and (ii) on other variables which themselves 
depend on time, such as coordinate variables: see e.g. http://en.wikipedia.org/wiki/Total_derivative . We 
here extend this concept to a configuration which involves several independent variables, i.e. w-12 and W21 
(and, again, other variables which themselves depend on the independent variables, i.e. Y\ and Y_). We 
keep the same type of notations as in the standard case involving a single independent variable. 
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dJ h W12W21 (\ - l) vl 2 y\ 1 



(18) 



dyi [l-w 12 w 21 yl \ k 2 - l Y 

dJh = w 12 w 2i y^ 1 (k - 1) y^T 2 ^ 

The case when k = 1 deserves a comment. As shown by (J2J)- (J3j) , the mixing model then 
becomes linear. Besides, as shown by (|18p - (jl9p . we then have 



dJ, 



h 



dyi 
dJ h 



dy: 



'2 



(20) 
0, (21) 



so that the total derivative in (|15p becomes equal to the partial derivative in (|15p . 
This clearly shows that the problems due to the distinction between these two derivatives, 
that we address in this paper, concern nonlinear mixtures. 

The last terms which are required to obtain the complete expressions in (|14|l R and (|15|) 
are all four derivatives For the sake of clarity, we now show how they may be con- 

sidered, when taking into account the above comments about total and partial derivatives. 
Here again, w± 2 and w 2 i should be considered as the independent variables, while y\ and 
y 2 are functions of them and the observations are constant. All these parameters are linked 
by ([E])-([9]). By first computing the total derivatives of the latter equations with respect to 
wi2, one gets 

dVl (yl + wukyt 1 ^) (22) 



dw\ 2 " dw\2 

dy 2 1 i-i dy 



dw\ 2 k dw\ 2 

Inserting ([23]) in ([22]) . one derives 

dyi -y 2 



w ^jyi xr-- (23) 



dw\ 2 -i r-l k- 

1 - wx 2 w 2 iy( y 2 
Then inserting (|24|) in (|23p . one obtains 



(24) 



dm_ = wnly[ y 2 ^ 

dw ^ l-w 12 w 2iy r l yt 1 ' 
Similarly, computing the total derivatives of (H])-© with respect to w 2 i eventually yields 

dyi _ wi 2 ky^yl~ l 



dw 2 i k- 

l - w 12 w 21 y£ y 2 

l 

dy 2 -Vx 



dw 2 \ -, r— 1 fe— 1 

/x 1 - wi 2 w 21 y{ y 2 



(27) 



4 Eq. (|14|l is obtained by taking the derivative of (|lUp with respect to w^e- It thus relies on the fact 
that d ^ — E{ipi(Yi) }. In [4], this result was borrowed from [8]. Considering the problems due 
to indirect dependencies in nonlinear mixtures found in [1], one may wonder whether the relationship 
d dJj Y ^ = E{ipt(Yi) f^' } still holds for the nonlinear mixing model studied in [4]. We claim that it does 
hoTcL 
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The expressions of all four derivatives obtained with this approach remain equal to 
the expressions (30)-(33) of [4], except that all partial derivative notations in [3] are 



here replaced by total derivative notations g^j- 

Gathering all above expressions then makes it possible to determine the total derivative 
in (|15|) . and then the overall gradient components in (|14|) . This yields the correct 
expression of the gradient of the considered BSS criterion based on mutual information. 

This correct gradient expression may eventually be used to optimize the adaptive co- 
efficients w\2 and W21, e.g. using the well-known gradient descent algorithm. 
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